Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZooKeeper to KRaft migration #9324

Closed
wants to merge 1 commit into from

Conversation

ppatierno
Copy link
Member

@ppatierno ppatierno commented Nov 6, 2023

Type of change

  • Enhancement / new feature

Description

This PR fixes #9433 about implementing the ZooKeeper to KRaft migration process.

The current status is about starting from a ZooKeeper-based cluster, creating a node pool with controllers, applying the strimzi.io/kraft: migration annotation on the Kafka CR and allowing the operator FSM to start the migration up to the "dual-write" (so creating controllers, rolling brokers, start migration). Then the user can apply the strimzi.io/kraft: enabled annotation in order to finilize the process by having the full KRaft-based cluster,
Different things are still missing:

  • setting some warnings in the Kafka CR status when annotation is applied in non valid states;
  • checking if readiness/liveness scripts need to evaluate the node role other than the migration state. Maybe it won't be fixed until the new KRaft readiness/liveness will land into the codebase;
  • checking the values for KAFKA_READY and ZK_CONNECTED env var in the Kafka run script;
  • the useKRaft flag passed around to the KafkaCluster is now considered "just" the value of the feature gate (it doesn't take into account strimzi.io/kraft: enabled annotation anymore, because now KRaft can be early enabled during migration). Maybe I should use anyway this flag together with all other checks? Not sure ...
  • unit tests and integration tests for new additions not developed yet;
  • logic in the Kafka agent able to get metrics about migration allowing the FSM to know when migration ends and move on to the next phase (which should already work because it would be just a matter of changing the strimzi.io/kraft annotation to enabled and allowing the operator to update nodes configuration and rolling them). NOTE: rolling controllers is a blocker due to Add support for KRaft in KafkaRoller #9146)

The PR won't address documentation. There will be a different PR for it.

Checklist

Please go through this checklist and make sure all applicable tasks have been done

  • Write tests
  • Make sure all tests pass
  • Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally
  • Reference relevant issue(s) and close them after merging
  • Update CHANGELOG.md

@ppatierno ppatierno changed the title First part on ZK-KRaft migration, from ZK to dual write ZooKeeper to KRaft migration Nov 7, 2023
Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments from an initial pass ...

@ppatierno ppatierno force-pushed the zk-kraft-migration branch 4 times, most recently from 25e87c6 to 8edb152 Compare November 23, 2023 17:10
Copy link
Contributor

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ppatierno, I left some comments, but looks good.

I also tried to run a migration after merging the Kafka Roller changes. It worked fine until KRaftDualWriting state, where I got this error when it was trying to roll the controllers.

2023-11-23 18:09:04 WARN  KafkaQuorumCheck:65 - Reconciliation #173(timer) Kafka(test/my-cluster): Error determining the controller quorum leader id
org.apache.kafka.common.errors.UnsupportedVersionException: The broker does not support DESCRIBE_QUORUM

After manually restarting the controllers one by one, the migration completed successfully. I verified that broker were connected to the quorum and working.

NAME         DESIRED KAFKA REPLICAS   DESIRED ZK REPLICAS   READY   METADATA STATE   WARNINGS
my-cluster   3                        3                                              
my-cluster   3                        3                     True    ZooKeeper        
my-cluster   3                        3                     True    ZooKeeper        
my-cluster   3                        3                                              
my-cluster   3                        3                                              
my-cluster   3                        3                     True    KRaftMigration   
my-cluster   3                        3                     True    KRaftDualWriting   
my-cluster   3                        3                     True    KRaftDualWriting   
my-cluster   3                        3                                                
my-cluster   3                        3                                                
my-cluster   3                        3                     True    KRaft  

I'm also attaching the CO log for this test.
zk2kraft.log

}

/**
* Compute the next desired nodes configuration related state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Compute the next desired nodes configuration related state
* Compute the next state for the metadata configuration to be applied to Kafka nodes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this ... it's more the "node configuration" (so if ZooKeeper and/or KRaft are needed) than the "metadata configuration". I mean from the metadata state, we are getting which node configuration we need, and this is represented by a "state". Maybe the overall naming is bad :-D

@@ -286,7 +300,9 @@ public Future<Void> reconcile(KafkaStatus kafkaStatus, Clock clock) {
// This has to run after all possible rolling updates which might move the pods to different nodes
.compose(i -> nodePortExternalListenerStatus())
.compose(i -> addListenersToKafkaStatus(kafkaStatus))
.compose(i -> updateKafkaVersion(kafkaStatus));
.compose(i -> updateKafkaVersion(kafkaStatus))
.compose(i -> maybeKafkaMetadataMigrationInProgress())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just use updateKafkaMetadataState, which also contains this logic. Wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by that? Can you elaborate a little bit more please?

@ppatierno
Copy link
Member Author

I also tried to run a migration after merging the Kafka Roller changes.

@fvaleri thanks for testing the two PRs working together!

It worked fine until KRaftDualWriting state, where I got this error when it was trying to roll the controllers.

So what you are seeing is expected and it's a problem for us. When it's the time to roll the controllers (so moving from the "dual write" mode to the full KRaft), the controllers (and the brokers as well) are still connected to ZooKeeper and in that state, the describeQuorum API doesn't work. This adds an additional complexity to the roller which needs to be able to determine the leader in a different way in such a case than using the describeQuorum method. @tinaselenge @katheris I think it's something to think about in your PR, otherwise it's not going to work with migration.

@ppatierno ppatierno force-pushed the zk-kraft-migration branch 3 times, most recently from d35682a to 9d7ecb4 Compare November 29, 2023 11:37
@ppatierno ppatierno added this to the 0.40.0 milestone Nov 29, 2023
@ppatierno ppatierno force-pushed the zk-kraft-migration branch 10 times, most recently from 3524cf3 to 634b897 Compare December 6, 2023 10:47
@ppatierno ppatierno force-pushed the zk-kraft-migration branch 3 times, most recently from 9c59528 to 7da7fda Compare December 12, 2023 13:16
Signed-off-by: Paolo Patierno <[email protected]>

Moved KafkaMetadataStateManager into the resource package

Signed-off-by: Paolo Patierno <[email protected]>

Added KafkaAgent endpoint to retrive ZooKeeper migration state metric
Updated KafkaAgent client to retrieve ZooKeeper migration state
Added logic to KafkaReconciler to check when migration ends

Signed-off-by: Paolo Patierno <[email protected]>

Fixed spotbugs issue

Signed-off-by: Paolo Patierno <[email protected]>

Moved KafkaMetadataConfigurationState in its own file
Exposed metadata configuration state evaluation in corresponding
KafkaMetadataConfigurationState instance methods
Using loop on controllers to get the ZooKeeper migration metrics
Other cleaning and refactor changes as per JS comments

Signed-off-by: Paolo Patierno <[email protected]>

Removed unused Random class

Signed-off-by: Paolo Patierno <[email protected]>

Moved the KafkaMetadataStateManager creation to the ReonciliationState
because needed in the assembly operator as well

Signed-off-by: Paolo Patierno <[email protected]>

Fixed not exposing control plane port 9090 on KRaft broker-only nodes
Fixed still exposing replication port 9091 on KRaft controllers during
migration

Signed-off-by: Paolo Patierno <[email protected]>

Fixed duplicate KRaft configuration section when mixed-node

Signed-off-by: Paolo Patierno <[email protected]>

Fixed tests

Signed-off-by: Paolo Patierno <[email protected]>

Fixed bug about keeping replication listener on controllers until
post-migration

Signed-off-by: Paolo Patierno <[email protected]>

Some refactoring (also based on FV comments)

Signed-off-by: Paolo Patierno <[email protected]>

Fixed validation for nodepools and features already in pre-migration

Signed-off-by: Paolo Patierno <[email protected]>

Addition on the configuration migration table
Removed not used method

Signed-off-by: Paolo Patierno <[email protected]>

Fixed checkstyle issue

Signed-off-by: Paolo Patierno <[email protected]>

Running spec checker validation only when migration happens to KRaft not
before

Signed-off-by: Paolo Patierno <[email protected]>

Fixed test

Signed-off-by: Paolo Patierno <[email protected]>

Updated liveness, readiness and run scripts to take process.roles into
account

Signed-off-by: Paolo Patierno <[email protected]>

Added warning conditions when strimzi.io/kraft annotation is not used in
the right way

Signed-off-by: Paolo Patierno <[email protected]>

Refactored the KafkaAgent and client in relation to the KRaft migration
endpoint

Signed-off-by: Paolo Patierno <[email protected]>

Added versions validation before starting a KRaft migration

Signed-off-by: Paolo Patierno <[email protected]>

Removed the useKRaft flag related to the feature gate flags
Making reconciliation failing fast when useKRaft feature flag is not
enabled
Updated related tests

Signed-off-by: Paolo Patierno <[email protected]>
@ppatierno
Copy link
Member Author

Closing this PR which is now replaced by #9480 having same content but originating from the zk-kraft-migration branch living on the Stimzi official repo and not on a fork anymore.

@ppatierno ppatierno closed this Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[KRaft] ZooKeeper to KRaft migration
4 participants